Search CORE

23 research outputs found

Towards certain fixes with editing rules and master data

Author: A.K. Elmagarmid
C. Batini
C.H. Papadimitriou
D. Loshin
F. Naumann
I. Fellegi
J. Chomicki
J. Widom
J. Wijsen
Jianzhong Li
M. Arenas
Nan Tang
O. Benjelloun
P. Giles
S. Abiteboul
S. Arora
Shuai Ma
T. Redman
T.N. Herzog
Wenfei Fan
Wenyuan Yu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2012
Field of study

Crossref

Edinburgh Research Explorer

Conditional Dependencies: A Principled Approach to Improving Data Quality

Author: A.K. Elmagarmid
C. Batini
E. Rahm
I. Fellegi
J. Chomicki
J. Chomicki
J. Wijsen
R. Fagin
S. Flesca
T. Imieliński
T. Redman
T.N. Herzog
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Abstract. Real-life date is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising ap-proach to improving data quality. It effectively detects and fixes inconsis-tencies in real-life data based on conditional dependencies, an extension of database dependencies by enforcing bindings of semantically related data values. It accurately identifies records from unreliable data sources by leveraging relative candidate keys, an extension of keys for relations by supporting similarity and matching operators across relations. In con-trast to traditional dependencies that were developed for improving the quality of schema, the revised constraints are proposed to improve the quality of data. These constraints yield practical techniques for data re-pairing and record matching in a uniform framework.

CiteSeerX

Crossref

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

Multi-source statistics:Basic situations and methods

Author: Baffour B.
Bakker B.F.M.
Biemer P.P.
Bikker R.
Bikker R.P.
Bishop Y.
Chambers R.L.
Cholette P.
Christen P.
Daalmans J.
De Wolf P.‐P.
Di Fonzo T.
Di Fonzo T.
Di Zio M.
Ding Y.
Ding Y.
Eurostat
Fienberg S.
Harron K.
Herzog T.N.
Houbiers M.
Kim J.K.
Linder F.
Meijer E.
Mushkudiani N.
Nordbotten S.
Pavlopoulos D.
Scholtus S.
Sefton J.
Shlomo N.
Singh A.C.
Stuart A.
UN/ECE
Zhang L.C.
Zhang L.C.
Zhang L.C.
Zhang L.C.
Publication venue: 'Wiley'
Publication date: 01/01/2020
Field of study

Many National Statistical Institutes (NSIs), especially in Europe, are moving from single‐source statistics to multi‐source statistics. By combining data sources, NSIs can produce more detailed and more timely statistics and respond more quickly to events in society. By combining survey data with already available administrative data and Big Data, NSIs can save data collection and processing costs and reduce the burden on respondents. However, multi‐source statistics come with new problems that need to be overcome before the resulting output quality is sufficiently high and before those statistics can be produced efficiently. What complicates the production of multi‐source statistics is that they come in many different varieties as data sets can be combined in many different ways. Given the rapidly increasing importance of producing multi‐source statistics in Official Statistics, there has been considerable research activity in this area over the last few years, and some frameworks have been developed for multi‐source statistics. Useful as these frameworks are, they generally do not give guidelines to which method could be applied in a certain situation arising in practice. In this paper, we aim to fill that gap, structure the world of multi‐source statistics and its problems and provide some guidance to suitable methods for these problems

Crossref

Tilburg University Repository

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution ⋆

Author: A.K. Elmagarmid
D. Dey
I. Bhattacharya
I.P. Fellegi
P. Christen
S.E. Whang
T.N. Herzog
Publication venue
Publication date: 01/01/2013
Field of study

Abstract. Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on static databases. However, real-world databases are often dynamic, and increasingly organizations need to resolve entities in real-time. Thus, there is a need for new techniques that facilitate working with dynamic databases in real-time. In this paper, we propose a dynamic similarity-aware inverted indexing technique (DySimII) that meets these requirements. We also propose a frequencyfiltered indexing technique where only the most frequent attribute values are indexed. We experimentally evaluate our techniques on a large realworld voter database. The results show that when the index size grows no appreciable increase is found in the average record insertion time (around 0.1 msec) and in the average query time (less than 0.1 sec). We also find that applying the frequency-filtered approach reduces the index size with only a slight drop in recall

CiteSeerX

Crossref

Merger in land data handling, blending of cultures

Author: Bronislovas M.
Cameron K.S.
Cameron K.S.
Deal T.E.
Herzog P.
Holmberg I
Muparari T.N.
Silva M.A.
Vrakking W.J.
Webler T.
Zevenbergen J.
Publication venue: 'Informa UK Limited'
Publication date: 29/02/2016
Field of study

Crossref

University of Twente Research Information

“Understanding Relationships Using Copulas,” Edward Frees and Emiliano Valdez, January 1998

Author: Burden R.L.
Cook R.D.
Fishman G.S.
Frees E.W.
Herzog T.N.
Iman R.L.
Johnson M.E.
Johnson N.
Klugman S.
Wang S.S.
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref